Skip to content

fix: preserve source id during sync imports#541

Closed
100yenadmin wants to merge 1 commit into
garrytan:garrytan/embedding-providersfrom
electricsheephq:fix/source-id-sync-imports
Closed

fix: preserve source id during sync imports#541
100yenadmin wants to merge 1 commit into
garrytan:garrytan/embedding-providersfrom
electricsheephq:fix/source-id-sync-imports

Conversation

@100yenadmin
Copy link
Copy Markdown
Contributor

@100yenadmin 100yenadmin commented Apr 30, 2026

Summary

Fixes named-source sync/import writing pages and related import state to default instead of the requested sources.id.

Root cause

sync --source <id> was only source-scoping sync bookkeeping. The import/write path did not receive sourceId, so page writes fell back to the schema/default source, and slug-based page/chunk/tag lookups were source-blind.

Changes

  • Thread optional sourceId through sync/import/import-file paths.
  • Add optional source scoping for page, chunk, tag, and version operations used by import.
  • Preserve existing default behavior: callers that omit sourceId still target source_id='default'.
  • Implement both PGLite and Postgres engine paths.

Validation

  • bun run typecheck
  • Local OpenClaw Support KB source sync repaired from openclaw-support-kb: 0 pages to openclaw-support-kb: 635 pages.
  • Source-scoped support KB status/search now passes against --source openclaw-support-kb.

Fixes #540.


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

Copilot AI review requested due to automatic review settings April 30, 2026 19:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes multi-source sync/import so that sync --source <id> correctly writes and reconciles imported pages (and related chunks/tags/versions) under the requested sources.id instead of accidentally falling back to the default source.

Changes:

  • Threads optional sourceId through sync → import → import-file paths.
  • Extends engine CRUD used by import (pages/chunks/tags/versions) to accept optional sourceId and scope DB lookups/writes.
  • Updates both Postgres and PGLite engine implementations to respect the source scope.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/core/types.ts Adds optional source_id to PageInput to allow per-page source scoping.
src/core/engine.ts Extends BrainEngine CRUD signatures to accept optional sourceId for source-scoped operations.
src/core/postgres-engine.ts Implements source-scoped page/chunk/tag/version operations for Postgres.
src/core/pglite-engine.ts Implements source-scoped page/chunk/tag/version operations for PGLite.
src/core/import-file.ts Threads sourceId through import pipeline and passes it into engine operations during import.
src/commands/sync.ts Passes opts.sourceId into import calls during sync, including full-sync import.
src/commands/import.ts Passes optional sourceId through to importFile during directory import.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/core/engine.ts
Comment on lines 115 to +135
@@ -131,8 +131,8 @@ export interface BrainEngine {
getEmbeddingsByChunkIds(ids: number[]): Promise<Map<number, Float32Array>>;

// Chunks
upsertChunks(slug: string, chunks: ChunkInput[]): Promise<void>;
getChunks(slug: string): Promise<Chunk[]>;
upsertChunks(slug: string, chunks: ChunkInput[], sourceId?: string): Promise<void>;
getChunks(slug: string, sourceId?: string): Promise<Chunk[]>;
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With getPage/getChunks/getTags/... now defaulting to source_id='default' when sourceId is omitted, the interface becomes inconsistent with listPages() / getAllSlugs() (which currently return entries across all sources). Callers that iterate via listPages() and then call getChunks(slug) / getTags(slug) (e.g. embed/export flows) will silently operate only on the default source for non-default pages. Consider adding source scoping to listPages/getAllSlugs (or including source_id in the returned Page shape) so multi-source callers can preserve source context end-to-end.

Copilot uses AI. Check for mistakes.
Comment thread src/core/import-file.ts
Comment on lines 184 to 291
export async function importFromContent(
engine: BrainEngine,
slug: string,
content: string,
opts: { noEmbed?: boolean } = {},
opts: { noEmbed?: boolean; sourceId?: string } = {},
): Promise<ImportResult> {
// Reject oversized payloads before any parsing, chunking, or embedding happens.
// Uses Buffer.byteLength to count UTF-8 bytes the same way disk size would,
// so the network path behaves identically to the file path.
const byteLength = Buffer.byteLength(content, 'utf-8');
if (byteLength > MAX_FILE_SIZE) {
return {
slug,
status: 'skipped',
chunks: 0,
error: `Content too large (${byteLength} bytes, max ${MAX_FILE_SIZE}). Split the content into smaller files or remove large embedded assets.`,
};
}

const parsed = parseMarkdown(content, slug + '.md');

// Hash includes ALL fields for idempotency (not just compiled_truth + timeline)
const hash = createHash('sha256')
.update(JSON.stringify({
title: parsed.title,
type: parsed.type,
compiled_truth: parsed.compiled_truth,
timeline: parsed.timeline,
frontmatter: parsed.frontmatter,
tags: parsed.tags.sort(),
}))
.digest('hex');

const parsedPage: ParsedPage = {
type: parsed.type,
title: parsed.title,
compiled_truth: parsed.compiled_truth,
timeline: parsed.timeline || '',
frontmatter: parsed.frontmatter,
tags: parsed.tags,
};

const existing = await engine.getPage(slug);
const existing = await engine.getPage(slug, opts.sourceId);
if (existing?.content_hash === hash) {
return { slug, status: 'skipped', chunks: 0, parsedPage };
}

// Chunk compiled_truth and timeline
const chunks: ChunkInput[] = [];
if (parsed.compiled_truth.trim()) {
for (const c of chunkText(parsed.compiled_truth)) {
chunks.push({ chunk_index: chunks.length, chunk_text: c.text, chunk_source: 'compiled_truth' });
}
}
if (parsed.timeline?.trim()) {
for (const c of chunkText(parsed.timeline)) {
chunks.push({ chunk_index: chunks.length, chunk_text: c.text, chunk_source: 'timeline' });
}
}

// v0.20.0 Cathedral II Layer 8 D2 — extract fenced code blocks from
// compiled_truth as first-class code chunks.
if (parsed.compiled_truth.trim()) {
const fenceChunks = await extractFencedChunks(parsed.compiled_truth, chunks.length);
chunks.push(...fenceChunks);
}

// Embed BEFORE the transaction (external API call).
// v0.14+ (Codex C2): embedding failure PROPAGATES. Silent drop accumulates
// unembedded pages invisibly. Caller can pass opts.noEmbed=true to skip.
if (!opts.noEmbed && chunks.length > 0) {
const embeddings = await embedBatch(chunks.map(c => c.chunk_text));
for (let i = 0; i < chunks.length; i++) {
chunks[i].embedding = embeddings[i];
chunks[i].token_count = Math.ceil(chunks[i].chunk_text.length / 4);
}
}

// Transaction wraps all DB writes
await engine.transaction(async (tx) => {
if (existing) await tx.createVersion(slug);
if (existing) await tx.createVersion(slug, opts.sourceId);

await tx.putPage(slug, {
type: parsed.type,
title: parsed.title,
compiled_truth: parsed.compiled_truth,
timeline: parsed.timeline || '',
frontmatter: parsed.frontmatter,
content_hash: hash,
source_id: opts.sourceId,
});

// Tag reconciliation: remove stale, add current
const existingTags = await tx.getTags(slug);
const existingTags = await tx.getTags(slug, opts.sourceId);
const newTags = new Set(parsed.tags);
for (const old of existingTags) {
if (!newTags.has(old)) await tx.removeTag(slug, old);
if (!newTags.has(old)) await tx.removeTag(slug, old, opts.sourceId);
}
for (const tag of parsed.tags) {
await tx.addTag(slug, tag);
await tx.addTag(slug, tag, opts.sourceId);
}

if (chunks.length > 0) {
await tx.upsertChunks(slug, chunks);
await tx.upsertChunks(slug, chunks, opts.sourceId);
} else {
// Content is empty — delete stale chunks so they don't ghost in search results
await tx.deleteChunks(slug);
await tx.deleteChunks(slug, opts.sourceId);
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces source-scoped import behavior, but there doesn’t appear to be a regression test asserting that importing with sourceId writes/updates pages, tags, chunks, and versions under that source (and not under default). Given the existing test coverage around import/sync and multi-source behavior, adding a focused test here would prevent this bug from recurring (ideally covering both PGLite and Postgres paths if the test suite supports it).

Copilot uses AI. Check for mistakes.
const sql = this.sql;
const hash = page.content_hash || contentHash(page);
const frontmatter = page.frontmatter || {};
const sourceId = page.source_id || 'default';
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putPage() uses page.source_id || 'default', which will silently treat an empty-string source id as the default source. That can reintroduce the “writes land under default” class of bugs if an upstream caller accidentally passes '' (or any other falsy value) instead of a real source id. Prefer nullish-coalescing (??) and/or validate that source_id is a non-empty string so invalid source IDs fail fast rather than falling back to default.

Suggested change
const sourceId = page.source_id || 'default';
const sourceId = page.source_id ?? 'default';
if (sourceId.trim().length === 0) {
throw new Error('page.source_id must be a non-empty string');
}

Copilot uses AI. Check for mistakes.
Comment thread src/core/pglite-engine.ts
slug = validateSlug(slug);
const hash = page.content_hash || contentHash(page);
const frontmatter = page.frontmatter || {};
const sourceId = page.source_id || 'default';
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putPage() uses page.source_id || 'default', which will silently treat an empty-string source id as the default source. That can reintroduce the “writes land under default” class of bugs if an upstream caller accidentally passes '' (or any other falsy value) instead of a real source id. Prefer nullish-coalescing (??) and/or validate that source_id is a non-empty string so invalid source IDs fail fast rather than falling back to default.

Suggested change
const sourceId = page.source_id || 'default';
const rawSourceId = page.source_id;
if (rawSourceId !== undefined && rawSourceId !== null && (typeof rawSourceId !== 'string' || rawSourceId.trim() === '')) {
throw new Error('page.source_id must be a non-empty string when provided');
}
const sourceId = rawSourceId ?? 'default';

Copilot uses AI. Check for mistakes.
Comment thread src/core/pglite-engine.ts
FROM pages WHERE slug = $1 AND source_id = $2
RETURNING *`,
sourceId === undefined ? [slug] : [slug, sourceId]
);
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createVersion() returns rows[0] without checking whether the INSERT…SELECT matched a page. When the slug (or slug+source) doesn’t exist, this will return undefined while the method is typed as Promise<PageVersion>, leading to downstream runtime errors. Mirror the PostgresEngine behavior here by throwing a clear error when rows.length === 0.

Suggested change
);
);
if (rows.length === 0) {
throw new Error(
sourceId === undefined
? `Page not found for slug "${slug}" and source_id "default"`
: `Page not found for slug "${slug}" and source_id "${sourceId}"`
);
}

Copilot uses AI. Check for mistakes.
@100yenadmin
Copy link
Copy Markdown
Contributor Author

Superseded by #639, rebuilt cleanly on current master with the same issue target (#540) and fresh validation. Closing this stale/conflicting branch to keep review focused.

@100yenadmin 100yenadmin closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants